
OFFICIAL SENSITIVE
This template demonstrates how to develop and share interactive reproducible analysis using Python and Jupyter Lab.
The data used in this template is for learning purposes only.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import holoviews as hv
from holoviews import opts
import hvplot.pandas
import datetime
import re
import locale
from locale import atof
pd.options.plotting.backend = 'holoviews'
The data sources used are fictional exam scores from Kaggle and datasets from ONS on 18-24 student population over time and DfE on 19+ student populations by Local Authority. The datasets can be found at Kaggle and ONS, ONS GeoPortal and GOV.UK
Source: Kaggle
exam_performance = pd.read_csv("data/students-performance.csv")
exam_performance.head(3)
| gender | race/ethnicity | parental level of education | lunch | test preparation course | math score | reading score | writing score | |
|---|---|---|---|---|---|---|---|---|
| 0 | female | group B | some college | free/reduced | completed | 71 | 85 | 83 |
| 1 | male | group B | some high school | standard | none | 41 | 40 | 34 |
| 2 | male | group C | high school | standard | none | 66 | 52 | 54 |
The exam performance data comprises 5,000 records (rows) and 8 features (columns).
exam_performance.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 gender 5000 non-null object 1 race/ethnicity 5000 non-null object 2 parental level of education 5000 non-null object 3 lunch 5000 non-null object 4 test preparation course 5000 non-null object 5 math score 5000 non-null int64 6 reading score 5000 non-null int64 7 writing score 5000 non-null int64 dtypes: int64(3), object(5) memory usage: 312.6+ KB
Source: ONS
students = pd.read_csv("data/total-students.csv")
students.head(3)
| date | aged 18 to 24 in full-time education (thousands) | |
|---|---|---|
| 0 | Mar-May 1992 | 984 |
| 1 | Apr-Jun 1992 | 999 |
| 2 | May-Jul 1992 | 1,012 |
students.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 291 entries, 0 to 290 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 291 non-null object 1 aged 18 to 24 in full-time education (thousands) 291 non-null object dtypes: object(2) memory usage: 4.7+ KB
Data Source: GOV.UK
Boundaries Shapefile Source: ONS GeoPortal
# TO DO
Transform the 18 to 24 student population from string 1,000 format to decimal 1000.00 using ASCII to float (atof).
locale.setlocale(locale.LC_NUMERIC, '')
students[["aged 18 to 24 in full-time education (thousands)"]].applymap(atof)
students["aged 18 to 24 in full-time education (thousands)"] = students["aged 18 to 24 in full-time education (thousands)"].map(atof)
students["aged 18 to 24 in full-time education"] = students["aged 18 to 24 in full-time education (thousands)"] * 1000
Add year and month columns to 18-24 student population data.
def add_year_column_to_students():
search = []
for values in students["date"]:
search.append(re.search(r'(\d{4})', values).group())
students["year"] = search
students["year"] = pd.to_datetime(students["year"]).dt.year
add_year_column_to_students()
def add_month_column_to_students():
students["month"] = students["date"].str.split(" ")[0][0]
add_month_column_to_students()
students.head(3)
| date | aged 18 to 24 in full-time education (thousands) | aged 18 to 24 in full-time education | year | month | |
|---|---|---|---|---|---|
| 0 | Mar-May 1992 | 984.0 | 984000.0 | 1992 | Mar-May |
| 1 | Apr-Jun 1992 | 999.0 | 999000.0 | 1992 | Mar-May |
| 2 | May-Jul 1992 | 1012.0 | 1012000.0 | 1992 | Mar-May |
exam_performance.describe()
| math score | reading score | writing score | |
|---|---|---|---|
| count | 5000.000000 | 5000.000000 | 5000.000000 |
| mean | 66.906600 | 69.791600 | 68.510800 |
| std | 15.136819 | 14.486302 | 15.216524 |
| min | 13.000000 | 19.000000 | 17.000000 |
| 25% | 57.000000 | 60.000000 | 58.000000 |
| 50% | 67.000000 | 70.000000 | 69.000000 |
| 75% | 77.000000 | 80.000000 | 79.000000 |
| max | 100.000000 | 100.000000 | 100.000000 |
exam_performance.hvplot.hist("math score", groupby="lunch", subplots=True, width=400, height=450, dynamic=False) +\
exam_performance.hvplot.hist("reading score", groupby="lunch", subplots=True, width=400, height=450, dynamic=False) +\
exam_performance.hvplot.hist("writing score", groupby="lunch", subplots=True, width=400, height=450, dynamic=False)
exam_performance.hvplot.scatter(x='math score', y='writing score', c="gender", groupby="race/ethnicity", cmap={"male": "#1d70b8", "female": "#f47738"}, legend='top', dynamic=False, height=500, width=1200)
grouped_performance = exam_performance.groupby(['race/ethnicity', 'gender'])['math score'].mean().sort_values()
grouped_performance.hvplot.barh('race/ethnicity', 'math score', by="gender", legend='bottom_right', cmap={"male": "#1d70b8", "female": "#f47738"}, height=500, width=950)
# https://stackoverflow.com/questions/55240460/hvplot-heatmap-with-pandas-dataframe-how-to-specify-value-dimensions
heatmap = exam_performance.corr().hvplot.heatmap(height=500, width=950, colorbar=True, cmap=["#5694ca", "#1daeb8", "#1d8fb8", "#1d70b8", "#003078"], value_label="test")
heatmap * hv.Labels(heatmap)
Students numbers (aged 18-24) have almost doubled since 1992. In 2016, the student population was around 1,896,000
from bokeh.models import HoverTool
hover = HoverTool(tooltips=[("year", "@year{int}"), ("aged 18 to 24 in full-time education (thousands)", "@{aged 18 to 24 in full-time education (thousands)}{int}")])
students.groupby("year").mean().hvplot(x='year', y='aged 18 to 24 in full-time education (thousands)', width=1100, height=400, hover=False) *\
students.groupby("year").mean().hvplot.scatter(x='year', y='aged 18 to 24 in full-time education (thousands)').opts(tools=[hover]) +\
students.groupby("year").mean().astype(int).hvplot.table(["year", "aged 18 to 24 in full-time education"], width=350, height=395)